Dynamic escalation of service conditions

ABSTRACT

Systems, methods, and software are provided for dynamically escalating service conditions associated with data center failures. In one implementation, a monitoring system detects a service condition. The service condition may be indicative of a failure of at least one service element within a data center monitored by the monitoring system. The monitoring system determines whether or not the service condition qualifies for escalation based at least in part on an access condition associated with the data center. The access condition may be identified by at least another monitoring system that is located in a geographic region distinct from that of the first monitoring system. Upon determining that the service condition qualifies for escalation, the monitoring system escalates the service condition to an escalated condition and initiates an escalated response.

TECHNICAL FIELD

Aspects of the disclosure are related to computing technologies, and inparticular, to data center monitoring and service condition escalation.

TECHNICAL BACKGROUND

Data centers are installations used to host a wide variety of computingapplications and associated data, such as email, social networking,search engine, business analytics, productivity, and gamingapplications. End users typically engage these applications by way ofdevices connected to data centers over the Internet, although other wasof connecting are possible. With the increase in cloud computing, datacenters have become even more prevalent as of late.

Most data centers are housed in facilities with redundant communicationlinks, power supplies, and other infrastructure elements, that allow fornearly continuous operation. Nevertheless, sophisticated monitoringsystems are often employed to monitor data center operations. In manysituations, monitoring systems external to the data centers communicatewith service elements installed within, such hardware or softwareresources, to report on the status of service elements, including whenthey fail. Some monitoring systems provide for the automated repair orrecovery of failed service elements.

However, some failures require the attention of staff personnel tovarying degrees. For example, when a repair or recovery operation isunsuccessful with respect to a failed service element, staff may bealerted to address the failure manually. When those failures occur,staff can be notified accordingly by way of emails, pages, phone calls,or the like. Large scale failures, such as a regional power outage ornatural disaster, may inhibit communication between the monitoringsystems and the service elements within a data center, causingassociated personnel to be notified.

OVERVIEW

Provided herein are systems, methods, and software for dynamicallyescalating service conditions associated with data center failures. Inone implementation, a monitoring system detects a service condition. Theservice condition may be indicative of a failure of at least one serviceelement within a data center monitored by the monitoring system. Themonitoring system determines whether or not the service conditionqualifies for escalation based at least in part on an access conditionassociated with the data center. The access condition may be identifiedby at least another monitoring system that is located in a geographicregion distinct from that of the monitoring system. Upon determiningthat the service condition qualifies for escalation, the monitoringsystem escalates the service condition to an escalated condition andinitiates an escalated response.

This Overview is provided to introduce a selection of concepts in asimplified form that are further described below in the TechnicalDisclosure. It should be understood that this Overview is not intendedto identify key features or essential features of the claimed subjectmatter, nor is it intended to be used to limit the scope of the claimedsubject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

Many aspects of the disclosure can be better understood with referenceto the following drawings. While several implementations are describedin connection with these drawings, the disclosure is not limited to theimplementations disclosed herein. On the contrary, the intent is tocover all alternatives, modifications, and equivalents.

FIG. 1 illustrates a monitoring environment in an implementation.

FIG. 2 illustrates a method of operating a monitoring system within amonitoring environment in an implementation.

FIG. 3 illustrates a sequence diagram pertaining to operations of amonitoring environment in an implementation.

FIG. 4 illustrates another monitoring environment in an implementation.

FIG. 5 illustrates a sequence diagram pertaining to the operation of amonitoring environment in an implementation.

FIG. 6 illustrates a sequence diagram pertaining to the operation of amonitoring environment in an implementation.

FIG. 7 illustrates a sequence diagram pertaining to the operation of amonitoring environment in an implementation.

FIG. 8 illustrates a sequence diagram pertaining to the operation of amonitoring environment in an implementation.

FIG. 9 illustrates a monitoring system in an implementation.

TECHNICAL DISCLOSURE

Implementations described herein provide for improved monitoring andalerting with respect to data center operations. In particular,monitoring environments disclosed herein provide for dynamicallyescalating service conditions based on access conditions related to adata center. In this manner, a service condition may be escalated to anescalated condition and an escalated response initiated thereto. Incontrast, a service condition that is not escalated may be responded toby way of a non-escalated response.

In a brief example, large scale failures, and other types of escalatedconditions, may be detected from the occurrence of service conditionswithin a data center and an evaluation of access conditions associatedwith the data center. Escalated conditions can be attended withescalated responses, while more mundane failures that previously mayhave triggered escalated responses can be handled in a non-escalatedmanner.

In some implementations, a monitoring system detects a service conditionindicative of a failure of at least one service element within a datacenter monitored by the monitoring system. The monitoring systemdetermines whether or not the service condition qualifies for escalationbased at least in part on an evaluation of an access conditionassociated with the data center. The monitoring system can carry out theevaluation in a variety of ways, including attempting to access the datacenter itself. In addition, the monitoring system may communicate withother monitoring systems to inquire as to their ability to access thedata center. In such a case, the other monitoring systems may begeographically remote from the monitoring system and possibly the datacenter. In this manner, the monitoring system can ascertain if the datacenter is generally inaccessible, which may indicate the occurrence of alarge scale failure or some other event that calls for escalatedhandling. Upon determining that the service condition qualifies forescalation, the monitoring system escalates the service condition to anescalated condition and initiates an escalated response.

Referring now to the drawings, FIG. 1 illustrates a monitoringenvironment in which a monitoring process, described in FIG. 2, may beimplemented. FIG. 3 illustrates a sequence of operations within themonitoring environment of FIG. 1. FIG. 4 illustrates another monitoringenvironment, while FIGS. 5-8 generally illustrate the operation of themonitoring environment in FIG. 4. FIG. 9 illustrates a monitoring systemrepresentative of those found in the monitoring environments of FIGS. 1and 4.

Turning to FIG. 1, monitoring environment 100 includes monitoring system101 and monitoring system 103, located in region 102 and region 104respectively. Monitoring system 101 and monitoring system 103 arecapable of communicating with data center 121, and service elements 123and 125 within data center 121, over communication network 111. Datacenter 121 is located in region 122. Regions 102, 104, and 122 arerepresentative of areas sufficiently distinct from each other that thecommunication path between monitoring system 101 and data center 121 hasat least one link or hop not in common with the communication pathbetween monitoring system 103 and data center 121. In someimplementations, none of the links in each respective communication pathare shared in common between monitoring system 101 and monitoring system103. Examples of regions 102, 104, and 122 are regions that aregeographically distinct from or otherwise different than each other,such as cities, states, provinces, countries, or continents, or anyother type of geographically distinguishable region.

Monitoring system 101 is any computing system capable of monitoring atleast some aspects of service element 123 or service element 125, orboth. Moreover, monitoring system 101 is any computing system capable ofdetecting and escalating service conditions as will be discussed in moredetail with respect to FIG. 2. Similarly, monitoring system 101 is anycomputing system capable of monitoring at least some aspects of serviceelement 123 or service element 125, or both. Monitoring system 900,discussed in more detail below with respect to FIG. 9, is an example ofsuitable system for implementing monitoring system 101 and monitoringsystem 102.

FIG. 2 illustrates monitoring process 200, which may be implemented byeither of monitoring system 101 or monitoring system 103. Forillustrative purposes, the discussion of FIG. 2 will proceed withrespect to an implementation of monitoring process 200 by monitoringsystem 101.

To begin, monitoring system 101 detects a service condition associatedwith a service element within data center 121, such as service element123 or 125 (step 201). Monitoring system 101 may execute variousmonitoring processes that evaluate information normally provided byservice elements 123 and 125. The monitoring processes may be capable ofprocessing the information to generate and report on service conditionsassociated with service elements 123 and 125. The service condition maybe communicated to monitoring system 101 by the service element, andthus is detected by monitoring system 101 upon processing communicationsindicative of the service condition. However, it should be understoodthat the service condition may be detected by monitoring system 101without need for communication with the service element. For example,the monitoring processes may also consider the lack or absence ofcommunication by the service element when generating the serviceconditions.

Upon detecting the service condition, monitoring system 102 determineswhether or not the service condition qualifies for escalation to anescalated condition representative of more than just the failure of aservice element (step 203). An escalated condition, relative to anon-escalated condition, may be considered any condition representativeof a problem having a greater scale than problems associated withnon-escalated conditions. For instance, a date center-wide outage may beconsidered an escalated condition relative to the failure of just asingle server machine within the data center. However, a variety ofconditions may be considered escalated conditions. For instance, thefailure of a substantial proportion of a data center may be consideredan escalated condition. Another distinction between escalated andnon-escalated conditions may be the variation in responses to the twokinds of conditions. For instance, an escalated condition may call for amore rapid response than a non-escalated condition. In another example,an escalated condition may result in alerting a greater number ofpersonnel than a non-escalated condition. It should be understood thatmany conditions may be considered escalated conditions beyond just thoseprovided for exemplary purposes herein.

Monitoring system 102 may make this determination based on a variety offactors, including an evaluation of access to data center 121. Theevaluation of access to data center 121 may include testing the accessbetween monitoring system 102 and data center 121, as well ascommunicating with monitoring system 103 to inquire about the conditionof access between monitoring system 103 and data center 121. If theservice condition qualifies for escalated handling based on the accesscondition of data center 121, then the service condition is handled inan escalated manner accordingly (step 203). For example, the servicecondition may be escalated to an escalated condition and an escalatedinitiated. However, it is possible that the access condition is suchthat the service condition is not escalated and can be handled with anon-escalated response.

For example, if monitoring system 101 is able to confirm that datacenter 121 is accessible, then the service condition need not beescalated. This determination may be made because the service conditioncan be considered to be caused by a failure or sub-optimal performanceof one of service elements 123 or 125, rather than a large scale failuregenerally impacting access to data center 121. Monitoring system 101 maydiscover the accessibility of data center 121 by way of an access testperformed by monitoring system 101 with respect to data center 121.

In another example, monitoring system 101 may not be able to access datacenter 121, as discovered by its access test, but monitoring system 103may report back to monitoring system 101 that data center 121 isaccessible. Monitoring system 103 may also discover the accessibility byperforming an access test with respect to data center 121. Monitoringsystem 101 can then determine to handle the service condition in anon-escalated manner based on the access condition of data center 121.

In yet another example, monitoring system 101 may be unable to determinethe access condition of data center 121 from either its own access testor the access test performed by monitoring system 103 with respect todata center 121. This may occur when monitoring system 101 is unable tocommunicate with data center 121 itself, but may also occur whenmonitoring system 101 is also unable to communicate with monitoringsystem 103. A communication failure between monitoring system 101 andmonitoring system 103 may result in an undetermined access condition ofdata center 121 since monitoring system 101 is not able to communicatewith monitoring system 103.

Under such circumstances, monitoring system 101 may be programmed orotherwise configured to respond in a variety of ways. In one scenario,monitoring system 101 may be configured to escalate the servicecondition since an inability to communicate with data center 121 andmonitoring system 103 may be indicative of a large scale failure thatrequires escalated attention.

In an alternative scenario, monitoring system 101 may be configured notto escalate the service condition since an inability to communicate witheither data center or monitoring system 103 may be indicative of aproblem localized to monitoring system 101. For example, a failure mayhave occurred with respect to communication links incoming to oroutgoing from monitoring system 101, inhibiting its ability tocommunicate, while monitoring system 103 and data center 121 may beoperating sufficiently.

In one implementation, a count of service conditions that may indicate afailure of a service element can be tracked. Determining if the servicecondition qualifies for escalation can occur when the count satisfies athreshold, such as meeting or exceeding a threshold count. In otherwords, while each single service condition may be evaluated forescalation, the existence of a single such service condition may notjustify the resources used to determine if the service condition shouldbe escalated. Rather, the effort may be put forth in response todetecting a certain number, quantity, or volume of service conditionsindicative of a failure of various service elements.

FIG. 3 illustrates sequence diagram 300 pertaining to a sequence ofoperations in monitoring environment 100. As illustrated, monitoringsystem 101 may exchange monitoring communications with service element123 during an operational period. For example, service element 123 mayreport on various operating conditions, such as processor utilization,application usage, and disk usage, as well as other operationalparameters that can be monitored. The operating conditions may becommunicated to monitoring system 101 in response to queries made bymonitoring system 101. However, service element 123 may also push theoperational information without prompting or querying by monitoringsystem 101.

During operation, monitoring system 101 detects a service conditionindicative of a failure of service element 123. For example, serviceelement 123 may itself communicate a failure status to monitoring system101, such as the failure of an application, a hardware element, or someother resource on or associated with service element 123. In anotherexample, service element 123 may fail to communicate with monitoringsystem 101, represented internally monitoring system 101 as a servicecondition. In other words, the lack or absence of monitoringcommunications by service element 123 may be indicative of a failure ofservice element 123 or any of its component aspects.

In response to detecting the service condition, monitoring system 101attempts an access test with respect to data center 121 to evaluatewhether or not data center 121 can be accessed communicatively bymonitoring system 101. In this illustration, the access test fails,representing data center 121 may possibly be inaccessible in general, orthat a communication problem has occurred locally with respect tomonitoring system 101 inhibiting it from communicating with data center121.

In order to ascertain whether the access test failed due to a generalproblem with data center 121 or a localized problem with thecommunication ability of monitoring system 101, monitoring system 101initiates a communication with monitoring system 103, located in ageographic area distinct from where monitoring system 101 is located, todetermine how monitoring system 103 may observe access to data center121.

Monitoring system 103 responsively initiates its own access test withrespect to data center 121. In this illustration, the access testinitiated by monitoring system 103 also fails. Monitoring system 103communicates the access condition of data center 121, as observed bymonitoring system 103, to monitoring system 101 for consideration in theevaluation of whether or not to escalate the service condition. Itshould be understood that the access test performed by monitoring system103 may return results different from an access test performed bymonitoring system 101 for a variety of reasons. For example, therelative differences or variations inherent to the communication pathslinking monitoring system 103 to data center 121 and monitoring system101 to data center 121 may cause substantially different results. Thismay especially be the case where at least a portion of one or the othercommunication path has failed.

Continuing with this illustration, monitoring system 101 is able toevaluate the access condition with respect to data center 121 asdetermined based on its own access test, such as a ping test, but alsothe access test performed by monitoring system 103. It should beunderstood that monitoring system 101 may communicate with othermonitoring systems in addition to monitoring system 103. Monitoringsystem 101 can consider the access condition as reported by eachmonitoring system when determining whether or not to escalate theservice condition.

In this example, the service condition is escalated to an escalatedcondition. An escalated response is taken to respond to the escalatedcondition. For example, alerts may be communicated to personnelresponsible for responding to escalated conditions. In contrast, had itbeen determined that the service condition need not be escalated, anon-escalated response may have been chosen to respond to the servicecondition. For example, a repair or recovery action may have beeninitiated, or even a wait period initiated, to address the failure ofthe associated service elements.

Turning to FIG. 4, monitoring environment 400 includes monitoring system401, monitoring system 403, and monitoring system 405. Monitoring system401 is located in region 402, while monitoring system 403 and monitoringsystem 405 are located in region 404 and region 406 respectively.Monitoring environment 400 also includes data center 421 and data center431. Monitoring systems 401, 403, and 405 are capable of communicatingwith data center 421 and data center 431 over communication network 410.Data center 421 is located in region 422 and data center 431 is locatedin region 432.

Regions 402, 404, 406, 422, and 432 are representative of areassufficiently distinct from each other that the communication pathbetween monitoring systems 401, 403, and 405, and data centers 421 and423 have at least one unique link or hop included therein. In this way,the result of access tests performed by any one monitoring system may beuseful to any other monitoring system when evaluating an accesscondition associated with a data center. Examples of regions 402, 404,406, 422, and 432 are any regions that are geographically distinct fromor otherwise different than each other, such as cities, states,provinces, countries, or continents, or any other type of geographicallydistinguishable region. are geographic regions that are geographicallydistinct from or otherwise different than each other, such as cities,states, provinces, countries, or continents, or any other type ofgeographically distinguishable region.

Data center 421 includes access system 426, service element 423, andservice element 425. Access system 426 provides elements external todata center 421 with access to service elements 423 and 425. Forexample, monitoring systems 401, 403, and 405 may communicate withservice elements 423 and 425 through access system 426. In addition,other computing devices, such as mobile phones, desktop computers,laptop computers, and tablet computers may communicate with elementswithin data center 421 through access system 426 when engaging withservices, applications, or data within data center 421.

Data center 431 includes access system 436, service element 433, andservice element 435. Access system 436 provides elements external todata center 431 with access to service element 433 and 435. For example,monitoring systems 401, 403, and 405 may communicate with serviceelements 423 and 425 through access system 426. In addition, othercomputing devices, such as mobile phones, desktop computers, laptopcomputers, and tablet computers may communicate with elements withindata center 431 through access system 426 when engaging with services,applications, or data within data center 431.

Communication network 410 may be any network or collections of networkscapable of carrying communications between monitoring systems 401, 403,and 405 and data centers 421 and 431. For illustrative purposes,communication network 411 includes paths 411, 413, 415, 417, and 419,which are representative of the various networks, systems, sub-systems,links, or other such segments of communication network 411 used todeliver communications to monitoring systems 401, 403, and 405 locatedin different geographic regions, regions 402, 404, 406. For instance,communications originating from or destined to monitoring system 401 maytraverse path 411, while communications originating from or destined tomonitoring system 403 may traverse path 413.

Further illustrated in FIG. 4, monitoring system 401 may include severalservice modules that can be called in response to a detected servicecondition, including an auto recovery module 407 and a staff alertmodule 408. It should be understood that monitoring system 401 mayinclude more or fewer modules than those illustrated herein. In eithercase, at last two modules may be present that are capable of handlingservice conditions according to at least an escalated service responseand a non-escalated service response. For example, staff alert module408 may be considered capable of implementing an escalated serviceresponse relative to a non-escalated service response implemented byauto-recovery module 407.

FIG. 5 illustrates sequence diagram 500 pertaining to the operation ofmonitoring environment 400 in an implementation. To begin, monitoringsystem 401 detects a service condition indicative of a failure of aservice element within data center 421. Initially, the service conditionmay call for a non-escalated response, such as initiating a repair orrecovery process provided by auto-recovery module 407. However,monitoring system 401 first determines whether or not to escalate theservice condition to an escalated condition by initiating an access testwith respect to data center 421. In this example, the access test fails.

The access test may fail for a number of reasons. For instance, path 411may be degraded or otherwise inoperable, thereby rendering monitoringsystem 401 incapable of communication with data center 421 and serviceelements 423 and 425 residing therein. However, the status of path 411may not yet be ascertained by monitoring system 401. Thus, monitoringsystem 401 next attempts to communicate with monitoring system 403 andmonitoring system 405 to determine the condition of access to datacenter 421 as determined by each monitoring system 403 and 405performing its own access test.

As illustrated, both monitoring system 403 and monitoring system 405 areable to successfully perform access tests with respect to data center421 and determine the access condition therefrom. Accordingly monitoringsystem 403 and monitoring system 405 communicate their respective viewsof the access condition to monitoring system 401 for considering indetermining whether or not to escalate the service condition.

In this example, monitoring system 401 determines not to escalate theservice condition based on the access condition of data center 421communicated by monitoring systems 403 and monitoring system 405. Notethat since monitoring system 403 and monitoring system 405 are able tocommunicate with data center 421, monitoring system 401 can determinethat its inability to communicate with data center 421 may be alocalized problem specific to monitoring system 401. The servicecondition can therefore be handled by auto-recovery module 407implementing a suitable non-escalated service response.

In an alternative, it is possible that the service condition need not beaddressed at all. For example, if it is positively determined that theservice condition is caused by a communication fault within or relatedto monitoring system 401, then it may be that data center 421 isoperating sufficiently. In other words, there may be no actual problemsassociated with service element 423 or service element 425 requiring theattention of either an escalated or non-escalated service response.

In another alternative, the service condition may be addressed byattending to whatever communication fault may have caused the servicecondition. For example, auto-recovery module 407 may still be called,but it may be in reference to a process or element within monitoringsystem 401 or aspects of path 411 inhibiting monitoring system 401 fromcommunicating effectively with data center 421.

FIG. 6 illustrates another sequence diagram 600 pertaining to theoperation of monitoring environment 400 in an implementation. To begin,monitoring system 401 detects a service condition that requires handlingaccording to a non-escalated service response. To determine whether ornot to handle the service condition according to an escalated ornon-escalated service response, monitoring system 401 initiates anaccess test with respect to data center 421. In this example, the accesstest fails.

The access test may fail for a number of reasons. For instance, path 411may be degraded or otherwise inoperable, thereby rendering monitoringsystem 401 incapable of communication with data center 421. However, thestatus of path 411 may not yet be ascertained by monitoring system 401.Thus, monitoring system 401 next attempts to communicate with monitoringsystem 403 and monitoring system 405 to determine the condition ofaccess to data center 421 as determined by each monitoring systemperforming its own access test.

In this illustration, the communications attempted between monitoringsystem 401 and monitoring systems 403 and 405 also fail, renderingmonitoring system 401 unable to learn of the condition of access to datacenter 421 as observed by monitoring system 403 and 405. Sincemonitoring system 401 is unable to evaluate the condition of access todata center 421, the service condition is escalated. Staff alert module408 is called, thereby launching alerts to on-call personnel or otherstaff identified as responsible for the service condition. For instance,automated phone calls, pages, or emails may be generated and transmittedinforming the personnel about the service condition.

FIG. 7 illustrates another sequence diagram 700 pertaining to theoperation of monitoring environment 400 in another implementation. Tobegin, monitoring system 401 detects a service condition that requireshandling according to a service response. To determine whether or not tohandle the service condition according to an escalated or non-escalatedservice response, monitoring system 401 initiates an access test withrespect to data center 421. In this example, the access test fails.

The access test may fail for a number of reasons. For instance, path 411may be degraded or otherwise inoperable, thereby rendering monitoringsystem 401 incapable of communication with data center 421. However, thestatus of path 411 may not yet be ascertained by monitoring system 401.Thus, monitoring system 401 next attempts to communicate with monitoringsystem 403 and monitoring system 405 to determine the condition ofaccess to data center 421 as determined by each monitoring systemperforming its own access test.

As illustrated, monitoring system 403 is able to successfully perform anaccess test with respect to data center 421 and determine the accesscondition therefrom. Accordingly monitoring system 403 communicates itsrespective view of the access condition, accessible, to monitoringsystem 401 for considering in determining whether or not to escalate theservice condition. However, monitoring system 405 is unable tosuccessfully perform an access test with respect to data center 421.This may occur due to a variety of reasons, including an operationalfault internal to monitoring system 405 or a communication fault on path415 or path 417, as well as for any number of other reasons. Thus,monitoring system 405 communicates the access condition of data center421 as inaccessible.

In this example, monitoring system 401 determines not to escalate theservice condition based on the access condition of data center 421communicated by monitoring systems 403 and monitoring system 405. Notethat, since monitoring system 403 is able to communicate with datacenter 421, monitoring system 401 can determine that its inability tocommunicate with data center 421 may be a localized problem specific tomonitoring system 401 or monitoring system 403. The service conditioncan therefore be handled by auto-recovery module 407 implementing asuitable non-escalated service response.

In an alternative, it is possible that the service condition need not beaddressed at all. For example, if it is determined that the servicecondition is caused by a communication fault within or related tomonitoring system 401 or monitoring system 405, then it may be that datacenter 421 is operating sufficiently. In other words, there may be noactual problems associated with service element 423 or service element425 requiring the attention of either an escalated or non-escalatedservice response.

In another alternative, the service condition may be addressed byattending to whatever communication fault may have caused the servicecondition. For example, auto-recovery module 407 may still be called,but it may be in reference to a process or element within monitoringsystem 401 or aspects of path 411 inhibiting monitoring system 401 fromcommunicating effectively with data center 421.

FIG. 8 illustrates another sequence diagram 800 pertaining to theoperation of monitoring environment 400 in an implementation. To begin,monitoring system 401 detects a service condition that requires handlingaccording to a service response. To determine whether or not to handlethe service condition according to an escalated or non-escalated serviceresponse, monitoring system 401 initiates an access test with respect todata center 421. In this example, the access test fails.

The access test may fail for a number of reasons. For instance, path 411may be degraded or otherwise inoperable, thereby rendering monitoringsystem 401 incapable of communication with data center 421. However, thestatus of path 411 may not yet be ascertained by monitoring system 401.Thus, monitoring system 401 next attempts to communicate with monitoringsystem 403 and data center 431 to determine the condition of access todata center 421.

In this illustration, the communications attempted between monitoringsystem 401 and monitoring system 403, and between monitoring system 401and data center 431, fail, rendering monitoring system 401 unable tolearn of the condition of access to data center 421 as observed byeither monitoring system 403 or data center 431. Since monitoring system401 is unable to evaluate the condition of access to data center 421,the service condition is escalated. Staff alert module 408 is called,thereby launching alerts to on-call personnel or other staff identifiedas responsible for the service condition. For instance, automated phonecalls, pages, or emails may be generated and transmitted informing thepersonnel about the service condition.

Referring now to FIG. 9, a monitoring system 900 suitable forimplementing monitoring process 200 is illustrated. Monitoring system900 is generally representative of any computing system or systemssuitable for implementing a monitoring system, such as monitoring system101, 103, 411, 413, and 415. Examples of monitoring system 900 includeany suitable computer or computing system, including server computers,virtual machines, computing appliances, and distributed computingsystems, as well as any other combination or variation thereof.

Monitoring system 900 includes processing system 901, storage system903, software 905, and communication interface 907. Processing system901 is operatively coupled with storage system 903 and communicationinterface 907. Processing system 901 loads and executes software 905from storage system 903, including monitoring process 200. When executedby monitoring system 900 in general, and processing system 901 inparticular, software 905 directs monitoring system 900 to operate asdescribed herein for monitoring process 200.

Monitoring system 900 may optionally include additional devices,features, or functionality. For example, monitoring system 900 mayoptionally have input devices, such as a keyboard, a mouse, a voiceinput device, a touch input device, a gesture input device, or othercomparable input devices. Output devices such as a display, speakers,printer, and other types of comparable input devices may also beincluded. These devices are well known in the art and need not bediscussed at length here.

Referring still to FIG. 9, processing system 901 may comprise amicroprocessor and other circuitry that retrieves and executes software905 from storage system 903. Processing system 901 may be implementedwithin a single processing device but may also be distributed acrossmultiple processing devices or sub-systems that cooperate in executingprogram instructions. Examples of processing system 901 include generalpurpose central processing units, application specific processors, andlogic devices, as well as any other type of processing device,combinations of processing devices, or variations thereof.

Storage system 903 may comprise any storage media readable by processingsystem 901 and capable of storing software 905. Storage system 903 mayinclude volatile and nonvolatile, removable and non-removable mediaimplemented in any method or technology for storage of information, suchas computer readable instructions, data structures, program modules, orother data. Storage system 903 may be implemented as a single storagedevice but may also be implemented across multiple storage devices orsub-systems. Storage system 903 may comprise additional elements, suchas a controller, capable of communicating with processing system 901.

Examples of storage media include random access memory, read onlymemory, magnetic disks, optical disks, flash memory, virtual memory, andnon-virtual memory, magnetic cassettes, magnetic tape, magnetic diskstorage or other magnetic storage devices, or any other medium which canbe used to store the desired information and that may be accessed by aninstruction execution system, as well as any combination or variationthereof, or any other type of storage media. In some implementations,the storage media may be a non-transitory storage media. In someimplementations, at least a portion of the storage media may betransitory. It should be understood that in no case is the storage mediaa propagated signal.

Software 905 includes monitoring process 200 which may be implemented inprogram instructions that, when executed by monitoring system 900,direct monitoring system 900 to detect service conditions, evaluateaccess conditions with respect to a data center, and determine whetheror not to escalate the service conditions based on the accessconditions.

Software 905 may include additional processes, programs, or componentsin addition to monitoring process 200, such as operating system softwareor other application software. Software 905 may also comprise firmwareor some other form of machine-readable processing instructions capableof being executed by processing system 901.

In general, software 905 may, when loaded into processing system 901 andexecuted, transform processing system 901, and monitoring system 900overall, from a general-purpose computing system into a special-purposecomputing system customized to facilitate dynamic escalation of serviceconditions as described herein for each implementation. Indeed, encodingsoftware 905 on storage system 903 may transform the physical structureof storage system 903. The specific transformation of the physicalstructure may depend on various factors in different implementations ofthis description. Examples of such factors may include, but are notlimited to the technology used to implement the storage media of storagesystem 903 and whether the computer-storage media are characterized asprimary or secondary storage.

For example, if the computer-storage media are implemented assemiconductor-based memory, software 905 may transform the physicalstate of the semiconductor memory when the program is encoded therein.For example, software 905 may transform the state of transistors,capacitors, or other discrete circuit elements constituting thesemiconductor memory. A similar transformation may occur with respect tomagnetic or optical media. Other transformations of physical media arepossible without departing from the scope of the present description,with the foregoing examples provided only to facilitate this discussion.

Through the operation of monitoring system 900 employing software 905,transformations may be performed with respect to monitoring process 200.As an example, monitoring system 900 could be considered transformedfrom one state to another by the handling of service conditions. In afirst state, a service condition may be detected that would normallycall for handling with a non-escalated service response. Upondetermining a particular access condition of a data center, it may bedetermined that the service condition should be escalated to anescalated condition and requires handling with an escalated serviceresponse, thereby changing monitoring system 900 to a second, differentstate.

Referring again to FIG. 9, communication interface 907 may includecommunication connections and devices that allow for communicationbetween monitoring system 900 and other monitoring systems and datacenters over a communication network. For example, monitoring system 101communicates with monitoring system 103 and data center 121 overcommunication network 111. Examples of connections and devices thattogether allow for inter-system communication include network interfacecards, antennas, power amplifiers, RF circuitry, transceivers, and othercommunication circuitry. The aforementioned network, connections, anddevices are well known and need not be discussed at length here.

In an operational scenario involving a data center hosting instances ofan application, a monitoring system external to the data center maydetect an application condition, of several application conditionsmonitored by the monitoring system, indicative of a failure of at leastone instance of the application running within the data center. Themonitoring system responsively determines if the application conditionqualifies for escalation based at least in part on an access conditionassociated with the data center identified by another monitoring systemlocated in a geographic region distinct from that of the monitoringsystem. Upon determining that the service condition qualifies forescalation, the monitoring system escalates the service condition fromthe application condition to a data center condition indicative of alarge scale failure of the data center.

Upon determining that the service condition qualifies for escalation,the monitoring system may initiate an escalated response to the datacenter condition. In addition, upon determining that the servicecondition does not qualify for escalating, the monitoring system mayinitiate a non-escalated response to the service condition.

Optionally, initiating the escalated response to the data centercondition may include generating and transmitting notifications of thelarge scale failure of the data center for presentation to personnelresponsible for handling the large scale failure of the data center.Initiating the non-escalated response may involve initiating a repair ora recovery of the instance of the application and, responsive to afailure of the repair or the recovery of the instance of theapplication, generating and transmitting a notification of the failureof the instance of the application to personnel responsible for handlingthe failure of the instance of the application.

The functional block diagrams, operational sequences, and flow diagramsprovided in the Figures are representative of exemplary architectures,environments, and methodologies for performing novel aspects of thedisclosure. While, for purposes of simplicity of explanation, themethodologies included herein may be in the form of a functionaldiagram, operational sequence, or flow diagram, and may be described asa series of acts, it is to be understood and appreciated that themethodologies are not limited by the order of acts, as some acts may, inaccordance therewith, occur in a different order and/or concurrentlywith other acts from that shown and described herein. For example, thoseskilled in the art will understand and appreciate that a methodologycould alternatively be represented as a series of interrelated states orevents, such as in a state diagram. Moreover, not all acts illustratedin a methodology may be required for a novel implementation.

The included descriptions and figures depict specific implementations toteach those skilled in the art how to make and use the best mode. Forthe purpose of teaching inventive principles, some conventional aspectshave been simplified or omitted. Those skilled in the art willappreciate variations from these implementations that fall within thescope of the invention. Those skilled in the art will also appreciatethat the features described above can be combined in various ways toform multiple implementations. As a result, the invention is not limitedto the specific implementations described above, but only by the claimsand their equivalents.

What is claimed is:
 1. A method for dynamically escalating serviceconditions associated with data center operations, the methodcomprising: detecting at least a service condition, of a plurality ofservice conditions monitored by at least a first monitoring system,indicative of a failure of at least a service element of a plurality ofservice elements within a data center; determining if the servicecondition qualifies for escalation based at least in part on an accesscondition associated with the data center identified by at least asecond monitoring system located in a geographic region distinct fromthat of the first monitoring system, wherein the access condition isindicative of an accessibility between the second monitoring system andthe data center; upon determining that the service condition qualifiesfor escalation, escalating the service condition to an escalatedcondition and initiating an escalated response to the escalatedcondition.
 2. The method of claim 1 further comprising upon determiningthat the service condition does not qualify for escalating, initiating anon-escalated response to the service condition.
 3. The method of claim1 further comprising transferring a request for the access conditionfrom the first monitoring system for delivery to the second monitoringsystem, responsively initiating an access test in the second monitoringsystem to determine the access condition, and transferring a replyindicative of the access condition as determined by the access test forconsideration in determining if the service condition qualifies forescalation.
 4. The method of claim 3 further comprising transferringanother request for the access condition from the first monitoringsystem for delivery to a third monitoring system, responsivelyinitiating another access test in the third monitoring system todetermine the access condition, and transferring another replyindicative of the access condition as determined by the another accesstest for consideration in determining if the service condition qualifiesfor escalation.
 5. The method of claim 4 further comprising responsiveto detecting the service condition, initiating an initial access test inthe first monitoring system to determine the access condition associatedwith the data center, and wherein transferring the request for theaccess condition from the first monitoring system for delivery to thesecond monitoring system comprises transferring the request in responseto a result of the initial access test indicating the access conditionas inaccessible or undetermined.
 6. The method of claim 1 whereindetermining if the service condition qualifies for escalation occurswhen a count of any of the plurality of service conditions that indicatefailures of any of the plurality of service elements satisfies athreshold, and wherein the escalated condition comprises a large scalefailure of the data center.
 7. The method of claim 1 wherein each of theplurality of service elements comprises an instance of an applicationhosted within the data center and wherein the plurality of serviceconditions comprise an absence of responses by the service element toqueries initiated by the first monitoring system related to performanceof the service element.
 8. The method of claim 1 wherein each of theplurality of service elements comprises an instance of an applicationhosted within the data center and wherein the plurality of serviceconditions comprises an absence of reporting related to performance ofthe service element scheduled to be generated and transferred by theservice element to the first monitoring system.
 9. One or more computerreadable storage devices having stored thereon program instructions fordynamically escalating service conditions, wherein the programinstructions, when executed by a first monitoring system, direct thefirst monitoring system to at least: detect at least a servicecondition, of a plurality of service conditions monitored by at leastthe first monitoring system, indicative of a failure of at least aservice element of a plurality of service elements within a data center;determine if the service condition qualifies for escalation based atleast in part on an access condition associated with the data centeridentified by at least a second monitoring system located in ageographic region distinct from that of the first monitoring system,wherein the access condition is indicative of an accessibility betweenthe second monitoring system and the data center; upon determining thatthe service condition qualifies for escalation, escalate the servicecondition to an escalated condition and initiate an escalated responseto the escalated condition.
 10. The one or more computer readablestorage devices of claim 9 wherein the program instructions furtherdirect the first monitoring system to initiate a non-escalated responseto the service condition upon determining that the service conditiondoes not qualify for escalation.
 11. The one or more computer readablestorage devices of claim 9 wherein the program instructions, whenexecuted by the first monitoring system, further direct the firstmonitoring system to transfer a request for the access condition fordelivery to the second monitoring system to initiate an access test todetermine the access condition, and receive a reply indicative of theaccess condition as determined by the access test for consideration indetermining if the service condition qualifies for escalation.
 12. Theone or more computer readable storage devices of claim 11 wherein theprogram instructions, when executed by the first monitoring system,further direct the first monitoring system to transfer another requestfor the access condition for delivery to a third monitoring system toinitiate another access test to determine the access condition, andreceive another reply indicative of the access condition as determinedby the another access test for consideration in determining if theservice condition qualifies for escalation.
 13. The one or more computerreadable storage devices of claim 12 wherein the program instructions,when executed by the first monitoring system, further direct the firstmonitoring system to, responsive to detecting the service condition,initiate an initial access test to determine the access conditionassociated with the data center, and wherein direct the first monitoringsystem to transfer the request for the access condition for delivery tothe second monitoring system in response to a result of the initialaccess test indicating the access condition as inaccessible orundetermined.
 14. The one or more computer readable storage devices ofclaim 9 wherein the program instructions, when executed by the firstmonitoring system, direct the first monitoring system to determine ifthe service condition qualifies for escalation in response to when acount of any of the plurality of service conditions that indicatefailures of any of the plurality of service elements satisfies athreshold, and wherein the escalated condition comprises a large scalefailure of the data center.
 15. The one or more computer readablestorage devices of claim 9 wherein each of the plurality of serviceelements comprises an instance of an application hosted within the datacenter and wherein the plurality of service conditions comprise anabsence of responses by the service element to queries initiated by thefirst monitoring system related to performance of the service element.16. The one or more computer readable storage devices of claim 9 whereineach of the plurality of service elements comprises an instance of anapplication hosted within the data center and wherein the plurality ofservice conditions comprise an absence of reporting related toperformance of the service element scheduled to be generated andtransferred by the service element to the first monitoring system.
 17. Amethod of operating a monitoring system in a monitoring environment todynamically escalate service conditions associated with data centeroperations, the method comprising: detecting at least an applicationcondition, of a plurality of application conditions monitored by atleast the monitoring system, indicative of a failure of at least aninstance of an application of a plurality of instances of theapplication running within a data center; determining if the applicationcondition qualifies for escalation based at least in part on an accesscondition associated with the data center identified by at least anothermonitoring system located in a geographic region distinct from that ofthe monitoring system, wherein the access condition is indicative of anaccessibility between the second monitoring system and the data center;upon determining that the application condition qualifies forescalation, escalating the application condition from the applicationcondition to a data center condition indicative of a large scale failureof the data center.
 18. The method of claim 17 further comprising: upondetermining that the application condition qualifies for escalation,initiating an escalated response to the data center condition; and upondetermining that the application condition does not qualify forescalating, initiating a non-escalated response to the applicationcondition.
 19. The method of claim 18 wherein initiating the escalatedresponse to the data center condition comprises generating andtransmitting notifications of the large scale failure of the data centerfor presentation to personnel responsible for handling the large scalefailure of the data center.
 20. The method of claim 19 whereininitiating the non-escalated response comprises initiating a repair or arecovery of the instance of the application and, responsive to failureof the repair or the recovery of the instance of the application,generating and transmitting a notification of the failure of theinstance of the application to personnel responsible for handling thefailure of the instance of the application.